DOCUMENT RESUME 



ED 271 483 

AUTHOR 
TITLE 

PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



TM 860 305 



Troendle, G. Roger 
on Rater Agreement 



and 



Littlefield, John H.; 
Rating FoL^mat Effects 
Reliability. 
Apr 86 

lOp.; Paper presented at the Annual Meeting of the 
American Educational Research Association (70th, San 
Francisco, CA, April 16-20, 1986). 
Speeches/Conference Papers (150) — Reports - 
Research/Technical (143 ) 

MFOl/PCOl Plus Postage. 

*Cognitive Processes; *Dental Evaluation; *Dental 
Schools; Evaluation Criteria; Higher Education; 
*Interrater Keliability; Judges; Measurement 
Techniques; *Medical School Faculty; *Rating Scales; 
Stimuli 



ABSTRACT 

This study compares intra- and inter-rater agreement 
and reliability when using three different rating form formats to 
assess the same stimuli. One format requests assessment by marking 
detailed criteria without an overall judgement; the second format 
requests only an overall judgement without the use of detailed 
criteria; and the third format combines detailed criteria with an 
overall judgement. Results are interpreted from a cognitive 
processing theoretical framevork. Subjects were five full-time and 
three part-time dental faculty members. The experimental task was to 
evaluate five crown preparations during six trials using each of 
three different rating forms, but raters were not informed they were 
reevaluating the same teeth. Raters were assigned code numbers to 
maintain anonymity, and teeth were identified only by code numbers. 
Data analysis was based upon ratings of five teeth from trials one 
through six; the trials were six weeks apart. Inter-rater agreement 
among the eight raters was distressingly low, but was in the range of 
one previous report. The study suggests that the traditional practice 
of scoring performance ratings by summary across multiple criteria 
may reduce intra-rater reliability. Rating forms which are structured 
to parallel rater cognitive processes may result in more reproducible 
scores than traditional summation scoring methods. (LMO) 
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Rating Format Effects on Rater Agreement and Reliability 
Background 

Summing across multiple items to yield a single total score is the 
traditional scoring method on rating forms used in education and industry. 
This practice is based on psychometric theory which recognizes that individual 
items have considerable spedfidty and measurement error (Nunnally. 1978). 
This scoring method may not be appropriate for performance ratings because 
a third party, the r^ter. produces the scored responses for the individual 
being assessed. The rater's information processing serves as a cognitive filter 
of the measurement data (Landy & Farr. 1980). If ire hope to increase the 
vaUdity of performance ratings, ve must learn more about how raters 
observe, encode, store and retrieve information marked on rating forms. 

In describing the cognitive process oT performance ratings, Feldman 
(1981 ) hypothesizes that raters can attend to a particular stimulus configu- 
ration without conscious monitoring. He suggests that stimuli are catego- 
rized into rmysets which are not defined by necessary and sufficient sets of 
attributes. A given stimulus (performance) is categorized based upon the 
extent to which it overlaps features of a rater's category firotoiype ( e.g.. 
young ambitious energetic employee). If a stimulus does not automatically fit 
a category prototype, a consciously controlled process will supersede the 
automatic process. Both the automatic and the consciously-controlled 
processes are based upon a prototype-matching operation. 

Human judgmental heuristics and knowledge structures uiisbett & Ross. 
1980) undoubtedly affect the cognitive process of performance appraisal. The 
perceiver is not a dutiful clerk who passively registers items of information. 
Instead, human perceivers actively interpret incoming perceptual data and 
form inferences about associations and ausal relations. Faculty who rate 
students are experts in the tasks to be evaluated. As experts, faculty have 
complex schematic cognitive struaures which provide an interpretive frame- 
work for making judgments about student performance. These cognitive 
structures may not directly correspond to the detailed rating criteria listed on 
a rating form. 

Given that faculty who rate students are experts at the tasks to be judged, 
it seems unlikely that they strictly attend to the directions of a traditional 
performance rating form (/a, fuige ctiterioo i, fudge criterion 2, ^ sua the 
criterion scores). Instead, it seems more likely that they match the stimulus 
performance against their own preconceived cntegory prototypes and then 
make a global judgment. Marking detailed criteria may occur in conjunction 
with the global judgment, but would not necessarily precede it as implied by 
summing the criterion scores to yield a total score. 

In summary, traditional performance rating forms are not structured to 
parallel the hypothetical cognitive processes used by experts to m&ke 
judgments (e.g., raters are not asked to make an overall judgment about the 
E RLC Performance), instead, the rating form is constructed by logically analyzing 



oofflponeots may be important in helping a learner to analyze the multiple 
steps in a task, but they are not necessarily useful criteria for expert raters 
evaluating a stimulus performance. 

This study compares intra- and inter-rater agreement and reliability when 
using three different rating form formats to assess the same stimuE One 
format requests assessment by marking detailed criteria without an overall 
judgment. The second format requests only an overall judgrient without the 
use of detailed criteria. The third format combines detailed criteria with an 
overall judgment. Results are interpreted from a cognitive processing 
theoretical framework. 



Methods 

Subjeas were five full-time and three part-time dental faculty members. 
They ranged in age from 28 to 60 years. All subjects had 2 or more years of 
clinical teaching experience in the Division of Crown and Bridge and had 
participated in construction of the detailed rating criteria used in this study. 

The rating task in this experiment is a routine part of the subjects' daily 
responsibilities. The experimental task was to evaluate five 3/4 crown 
preparations twice using each of three different rating forms: 

1. form CrC - a 19 item criterion checklist in which raters marked each 
criterion on a 3 category scale (acceptable, needs improvement, or 
unsatisfactory). A single composite sl re was calculated ex post facto 
by summing the marks on the 19 individual criteria (typical checklist). 

2. fdcm 6J - a global judgment on a 5 point scale (0-4) with no detailed 
criteria. 

3. Ikirm Com - a combination of the 1 9 item criterion checklist (Form 
CrC) plus global judgment (Form GJ). The rater marked the individual 
criteria and also made a global judgment on a 4 point scale (0. 2. 3. <\ 

Appendix 1 is a sample of Form Com. Note that the grading code allows 3 
"I" ratings (improvement needed) to receive a grade of "2" while \ "I" ratings 
results in a grade of "0" (failure). The omission of a "1" in the grade code 
reflects an evaluation philosophy which requires a satisfactory performance 
level to attain ciuuatltoceptabMy. The occurence of 1 "U" rating or A "V 
ratings results in a judgment of "failure" (0) for the crown preparation. 

Table 1 summarizes the design of the study. 
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Table 1 - Deiigo of the Study 



Rating Form 


Criterion Checklist 
(CrC) 


Global Judgment 
(GJ) 


Combination 
(Com) 


Thai Number 


1 


2 


3 


4 


5 6 


* of Raters 


8 


8 


8 


8 


8 8 


* of Teeth 


15 


5 


5 


5 


5 5 



Data collection procedures were described in detail by Troendle (1983). 
Raters were assigned code numbers to maintain anonymity. Fifteen crown 
preparations (teeth) irere evaluated during trial one as sbown in Table 1 . For 
trials 2 through 6, five teeth were selected based upon the trial 1 ratings: a. 
two teeth that were easy to evaluate (high inter-rater agreement), b. two that 
were difficult to evaluate (low agreement), and c one tooth that was of 
intermediate difficulty. Raters were not informed that theywere re-evalua- 
ting the same five teeth. Teeth were identified only by code numbers and at 
least six weeks intervened between each trial session. Data analysis was 
based upon ratings of five teeth from trials 1 through 6. 

Three types of scores were available for analysis: 

1. Ratings on detailed criteria (Forms CrC and Com only) 

2. Sum mated scores calculated by assigning 2. 1, or 0 to each A. I. or 
U rating on the detailed criteria then summing across the 19 
detailed cri^ieria (Forms CrC and Com only). 

3. Competency-based scores using the 4.3.2.0 grading code shown in 
Appendix 1. Subjects provided this score when using Forms Com 
and GJ while the authors calculated it for Form CrC 

The term competency-based was used to signify the discontmuous score scale 
inherent in the cUniaU tooeptabMy evaluation philosophy described above. 
Scores from Form GJ were classified as competency-based because this grading 
procedure is routinely used in the Dental School and w&s familiar to the raters. 
Table 2 summarizes the types of scores available for each rating format. 



Table 2 - Three Types of Rating Data 



Form CrC FormGJ Form Com 



Rating? of 
Detailed Criteria 


Yes 


Mo 


Yes 


3 urn mated 
Scores 


Yes 


No 


Yes 


Competency- 
based Scores 


Yes 


Yes 


Yes 



Rating data were analyzed to answer four questions related to intra- and 
inter-rater agreerjent and reliability. Agreement was analyzed only on the 
ratings of detailed criteria (see Table 2) and was defined as identical ratings 
on a criterion. Two rater agreement questions were addressed: 

1. Intra-rater agreement on the detailed criteria? 

2. Inter-rater agreement on the detailed criteria? 

Intra- and inter-rater agreement on the detailed criteria were assessed using 
a tau coefficient suggested by Tinsley and Weiss ( 1975). It is a chi square 
test to ensure that observed agreement exceeds chance levels followed by 
calculation of percent agreement adjusted down for chance agreements. 
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Reliability was defined as the degree to which the overall scores (both 
summated and competency-based in Table 2) are proportional when 
expressed as deviations from the judges' mean score. Two rater relitbiUty 
questions were addressed: 

1. Intra-rater reliability on the overall scores? 

2. Inter-rater reliability on the overall scores? 

Intra-rater reliability on the overall scores was assessed using a Pearson 
produa- moment correlation coefficient and inter-rater reliability was 
assessed using an intradass correlation coefficient (Finn, 1970). Statistical 
significance of differences among correlation coefficients was assessed using a 
multiple comparison test suggested by Marasculio ( 1966). 

Results 

Table 3 presents a summary of the data analysis results. 

Table 3 - Data Analyiis Results 



FormCrC FormGJ Form Com 



Agr eement on 








Detailed Chteha 








t. Intra-rater 


75% 




86% 


2. Inter- rater 








Reliability of 








Summated Scores 


1 






1 Intra-rater 


.59 




\ 

.83 ' 


2. Inter- rater 


.50 




.5C 


Reliability of 








Com p. -based scores 








1 Intra-rater 


.55 


.73 


1 

.88 ' 


I. Inter- rater 


.51 


.63 


.5b- 



' - brackfU dtnoti liiniricut dirfarencM (p<.03) 
" - doM not wcMd duara ■frMMM 



Inter-rater agreement on Form (9%) did not exceed chance level. 
Intra-rate. reliability coefficients for Forms CrC; GJ. and Com using both 
competency-based and summated scores were significantly different (p<.05). 
Form CrC coefficients differed from Form Com for both types of scores as 
shown by the brackets in Table 1. Inter-rater reliability coeffidenis assoc- 
iated with each Form were not statistically different. 

Discussion 

The intra-rater agreement levels in Table 3 paralld previous reports 
(Houpt & Kress, 1973). Inter-rater agreement among the 8 raters is 
distressingly low (i.e.. did not exceed chance levels), but is in the general 
range of one previous report (Natkin & Guild. 1 973). When inter-rater 



•greemeni and intrt-rtter reliability results are viewed together, this study 
suggests that scores are more reproducible when the rating format requests a 
global judgment in conjunction with marking detaUed criteria (Form Com). 
This is not the scoring procedure tradiUonally used with performance ratings 
(Form CrC). 

Form CrC is typical of checklists which assume that expert raters evaluate a 
performance by assessing each detailed criterion individually' while marking 
the corresponding blanks on the form. Summing the marks to get a single 
composite score could be delegated to a computer. However, recent reports 
on the role of prior knowledge in comprehension of medical information 
showed that experts made more inferences on high relevance information 
than either novices or intermediates (Patel, et. al, 1 984). The raters in this 
study were experts at the task of preparing teeth for crown restoraUons. In 
judging a crown preparation, it seems likely that they would form an overaU 
judgment based upon high relevance information in conjunc- tion with 
assessing each of the detailed aiteria. CogniMyt processes such as coaftmt- 
tkmistorieatttioa ((iwper, 1981) would influence a rater toward marking the 
detailed criteria to conform with his/her initial overall judgment. In 
summary, the structure of Form (k)m encourages an overall judgment and 
therefore parallels the sequence of rater cognitive processes more closely than 
form CsC. The partUeJ structure between Form Com and rater cognitive 
processes may have resulted in the improved intra-rater reliability for Form 
Com under both sum mated and competency-based scoring methods. 

Mackenzie et. al., (1982) describe the results of . detailed investigation of 
rater error: using dental performance checklists similar to those in this study. 
They conclude that, "... clearly defined unambiguous checkpoints are probably 
the most important factors in producing reasonable agreement among 
evaluators". Mackenzie et. al. also note that a criterion used to judge dental 
products is often based upon an opinion without validating whether clinical 
usefulness is impaired when that criterion is not satisfactorily achieved. The 
results of this study can be combined with the Mackenzie et. al. conclusions to 
suggest guidelines for developing rating form criteria and scoring procedures. 

1. Develop clearly defined unambiguous checkpoints. 

2. Use only criteria which can be shown to impair clinical usefulness when 
not satisfaaorily achieved. 

3. Use a scoring procedure which faciliutes the ability of experts to 
distinguish between good and poor performances 

A. Use global judgment scoring in combination with marking detailed 
criteria (Form Com) jfislfiAd of sufflming across the individual criteria. 
In summary, the rating form should reflect cognitive processes used by 
experts judging the procedure rather than trying to trtin experts to use the 
form consistently. 

Construct psychology (FranseUa & Bannister, 1977) offers a technology for 
identifying the tctud criteria used by experts to make judgments. The basic 
approach is to show expert raters examples of good and bad performances 
then ask them how various pairs differ. The final results are constructs used 

5 7 



by experts to distinguish among performances rather than logical steps used 
to teach the skill to a novice. The concept underlying identifying rater 
constructs is quite similar to the rttnaslatkm technique originally proposed 
to develop Behaviorally Anchored Rating Scales (Smith & Kendall. 1963), but it 
is less time consuming. 

This study is marred by at least four weaknesses. First, the raters knew 
the data were for research purposes. Landy and Farr (1980) noted that 
ratings for administrative purposes will be more lenient than those for 
research purposes. Raters in this study may have performed differently if 
the scores were to be used to determine student grades. A second weakness 
is the failure to use a randomized block design. One could argue that the 
raters l&Mned the teeth in rating them six times. The teeth were numbered 
with different ink and upe for each trial and stored loosely in a box. Posthoc 
conversations with the raters did not indicate that they recognized the same 
crown preparations were being used repeatedly. A third problem is the use of 
parametric statistics with a discontinuous competency-based scoring scale 
(0,2,3,<4). This may have affected the size of the competency-based reliability 
coefficients; however, methodological studies of factor analyses with nume- 
rical scales that are not equal interval suggest that the correlation coefficients 
wiU not be substantially affected (Baggaley & Hull, 1983). In addition, the 
summated score reliability coefficients also support the superiority of Form 
Com (see Table 3). Finally, the fourth weakness is a failure to find significant 
differences among the inter-rater reliability coefficients. DiStef ano ( 1 98 1 ) has 
shown that intra-dass correlation coefficients have a large sampling error 
when based upon relatively small sample sizes such as these (n>^0 jcores). 



Conclusions 

This study suggests that the traditional practice of scoring performance 
ratings by summing across multiple criteria may reduce intra-rater reliability. 
The results are consistent with a cognitive process of prototype matching in 
which the overall judgment made by an expert rater is an important part of 
the performance evaluation process. Rating forms which are struaured to 
parallel rater cognitive processes (i.e., request experts to make an. overall 
judgment somewhat independent of the detailed criteria) may result in more 
reproducible scores than traditional summation scoring methods. Detailed 
criteria are important to document the rationale which supports ;a rater's 
global judgment: however, it is not likely that criteria on a rating form 
mechanically structure the rater's judgment process as implied in traditioc 1 
performance checklist directions and scoring. In short, the whole score may 
be different than the sum of the detailed criteria. 

Problems with low inter-rater agreement in marking detailed criteria on 
rating forms may be due to the use of inappropriate criteria. Logical steps 
used to teach students to perform a procedure may not be helpful to experts 
in making consistent discriminative judgments. Techniques from construct 
psychology are recommended to help identify criteria which experts use to 

o distinguish between good and poor performance of a particular task. 
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APPENDIX 1 

FORM CR-X: THE CRITERION-REFERENCED FORM WITH A 
COMPOSITE SCORE FOR TRIALS 5 AND 5 



Code # 
Tooth 0 
Date 



Grading Code 

) - 1 1=4 

2 1=3 

3 1=2 

4 1=0 
1 .U = 0 



PREPARATION (3/4 Crown) 
A I U 

I 1 1 1 1 I Axial reduction (over/under reduced) 
I j| |] [ Occlusal reduction (over/under reduced) 
i 1 1 I i i No sharp line angles present 
i j I 1 1 I Two-plane reduction utilization 
I i i i I I Margin location 
I li i I I Margin smoothness, continuity 
I j| il [ Margin type 
I II 1 1 i Occlusal convergence 2° - 5° 
I 1 1 i I i Occlusal convergence less than 15° 
I j I } j I No undercuts present 

j j| I ! [ Position of proximal boxes slightly buccal to the middle 
of the tooth 

I II j I j Proximal boxes in same line of draw 

I j I 1 1 I Line of draw of boxes with the rest of the prep 

r~1 1 1 1 j 20 - 5° occlusal divergence of proximal boxes 

I I i 1 1 I Blending of occlusal groove with proximal boxes 

I I j 1 1 i Length of box 

QQQDepth of box 

I j I 1 1 ~| Margin below gingival floor of box 

□ □Cli Other* 
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